---
title: Machine Learning Analytics with IPython Notebook
---

<!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements.  See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License.  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->

[IPython Notebook](http://ipython.org/notebook.html) is a very powerful
interactive computational environment, and with
[Apache PredictionIO](http://predictionio.apache.org),
[PySpark](http://spark.apache.org/docs/latest/api/python/) and [Spark
SQL](https://spark.apache.org/sql/), you can easily analyze your collected
events when you are developing or tuning your engine.

## Prerequisites

Before you begin, please make sure you have the latest stable IPython installed,
and that the command `ipython` can be accessed from your shell's search path.

<%= partial 'shared/datacollection/parquet' %>

## Preparing IPython Notebook

Launch IPython Notebook with PySpark using the following command, with
`$SPARK_HOME` replaced by the location of Apache Spark.

```
$ PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS="notebook --pylab inline" $SPARK_HOME/bin/pyspark
```
If you see a error appearing in the console like this:

```
[E 10:07:53.900 NotebookApp] Support for specifying --pylab on the command line has been removed.
[E 10:07:53.901 NotebookApp] Please use `%pylab inline` or `%matplotlib inline` in the notebook itself.
```

Then you can use the following command.

```
PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS="notebook --`%pylab inline`" $SPARK_HOME/bin/pyspark
```
By default, you should be able to access your IPython Notebook via web browser
at http://localhost:8888.

Let's initialize our notebook for the following code in the first cell.

```python
import pandas as pd
def rows_to_df(rows):
    return pd.DataFrame(map(lambda e: e.asDict(), rows))
from pyspark.sql import SQLContext
sqlc = SQLContext(sc)
rdd = sqlc.parquetFile("/tmp/movies")
rdd.registerTempTable("events")
```

![Initialization for IPython Notebook](/images/datacollection/ipynb-01.png)

`rows_to_df(rows)` will come in handy when we want to dump the results from
Spark SQL using IPython Notebook's native table rendering.

## Performing Analysis with Spark SQL

If all steps above ran successfully, you should have a ready-to-use analytics
environment by now. Let's try a few examples to see if everything is functional.

In the second cell, put in this piece of code and run it.

```python
summary = sqlc.sql("SELECT "
                   "entityType, event, targetEntityType, COUNT(*) AS c "
                   "FROM events "
                   "GROUP BY entityType, event, targetEntityType").collect()
rows_to_df(summary)
```

You should see the following screen.

![Summary of Events](/images/datacollection/ipynb-02.png)

We can also plot our data, in the next two cells.

```python
import matplotlib.pyplot as plt
count = map(lambda e: e.c, summary)
event = map(lambda e: "%s (%d)" % (e.event, e.c), summary)
colors = ['gold', 'lightskyblue']
plt.pie(count, labels=event, colors=colors, startangle=90, autopct="%1.1f%%")
plt.axis('equal')
plt.show()
```

![Summary in Pie Chart](/images/datacollection/ipynb-03.png)

```python
ratings = sqlc.sql("SELECT properties.rating AS r, COUNT(*) AS c "
                   "FROM events "
                   "WHERE properties.rating IS NOT NULL "
                   "GROUP BY properties.rating "
                   "ORDER BY r").collect()
count = map(lambda e: e.c, ratings)
rating = map(lambda e: "%s (%d)" % (e.r, e.c), ratings)
colors = ['yellowgreen', 'plum', 'gold', 'lightskyblue', 'lightcoral']
plt.pie(count, labels=rating, colors=colors, startangle=90,
        autopct="%1.1f%%")
plt.axis('equal')
plt.show()
```

![Breakdown of Ratings](/images/datacollection/ipynb-04.png)

Happy analyzing!
